Introduction

The code to replicate this report can be found at https://github.com/jtattersall09403/brookes_ml_intro

Background

  • “The task is to take an image of a handwritten digit and determine which of the numbers 0-9 is written in it”
  • Supervised machine learning task
  • Classification-type problem

Classification problem

  • What is a classifcation problem?
  • Subset of supervised learning problems
  • i.e. vector of training data/predictors and a corresponding set of target output values, where our goal is to learn the function that maps the predictors to the target
  • Classification problems are those where the target variable is discrete rather than continuous
    • i.e. where the outcomes we are trying to predict are categories (e.g. faces, flower species, dogs vs cats)
    • As opposed to continuous (e.g. income, age, blood pressure)
  • We do this by training a machine learning algorithm to learn the function that maps training data to target category labels

Training and test sets

  • Our aim is not just to describe the relationships that we identify in a single dataset
  • Also want to be able to predict class membership of new cases
  • That is, we want our algorithm to be able to take new examples it hasn’t seen before, and accurately predict what class that example is
  • One way to evaluate how good our algorithm is at doing this, we need a set of “unseen” data to test it on
  • So we train it on one dataset and test it on another
  • If we don’t do this, we risk over-fitting to the training data
  • But we also want to not only estimate how good our final model will be at predicting from new data (i.e. validate it). We also want to select the model paramters that are bet for this task. If we just see which set of parameters do best on our holdout data (i.e. use it for model selection), this means we can no longer use our holdout data to validate our model, as we have already used it for model selection - so we would risk overfitting to the holdout data
  • In addition, we want our predictions to have low variance - we want it to make consistently good predictions on new datasets
  • So we will use k-fold cross validation. This involves selecting a random subset of the data to train a model with a certain set of parameters, then testing it on the rest of the data. This is repeated a number of times, and then the entire process is repeated for other sets of parameter values. We can then select the parameter value for our final model that has the highest average accuracy across training folds with the lowest variance.
  • To avoid overfitting to the test data, we will also keep an entirely separate set of hold-out data, which will be used to validate the final model (i.e. to estimate how well it is likely to perform on entirely new ‘unseen’ data.)

The MNIST dataset

  • Large datset of images of handwritten digits (put more background on this from website)
  • Labelled 0-9
  • Where each image consists of 28x28 pixels, each of which has a numerical value representing its position on the white-black scale
  • So we have a categorical target variable (the label, which can be one of a discrete set of values), and a vector of predictors (the pixels). This is therefore a classification-type supervised learning problem
  • The dataset has already been split into training and test data
  • Frequency of each digit in the dataset shown below. Ones are most common and fives are least common

Dimensionality Reduction

Principal Component Analysis

  • First algorithm we will try. Classic approach to dimensionality reduction
  • (Explain what PCA does)
  • Plot below shows that the first 2 principal components explain around 17% of the variation in the data
  • Even 20 components explains only around two thirds

  • Despite this, we can still use this model to see some of the similarities and differences between the digits

  • As these two components explain relatively little of the variation, the pciture is not particularly clear
  • However, even so, we can see that zeroes and ones are generally far apart from one another. It seems as though these two digits are very dissimilar to one another when drawn by hand
  • Twos and nines are also fairly dissimilar
  • The fives group overlaps significantly with threes and eights

t-SNE

  • Alternative dimensionality reduction algorithm
  • (Brief explanation of what it is)
  • Has been found in the ltierature to provide a clearer picture than PCA in a variety of contexts, including the MNIST dataset (http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf)
  • (Describe model training)
  • The plot below shows a sample of the digits plotted according to their two t-SNE component scores

  • This picture is much clearer, with more separation between the digits
  • Zeroes and ones still far apart, as with PCA
  • Fives, eights and threes still close together, with some overlap
  • Other patterns emerge much more clearly (e.g. zeroes closest to sixes and fives, lots of overlap between nines and fours, both of which are similar to sevens)

An example of two different digits occupying the “fives” space can be seen below:

Hand-drawn digits similar to a typical five

Hand-drawn digits similar to a typical five

K-Nearest Neighbour

Training the model

  • K-nearest neighbour predicts class membership by using all predictors to calculate the distance between cases in the data
  • When predicting the class of a particular case, the algorithm finds its k nearest neighbours (with k defined in the model specification), looks at their class membership, finds which class the majority of them belong to, and assigns that class to the case in question
  • A new, unseen case will be compared to the cases the model was trained on, and given a class according to the majority class membership of its k nearest neighbours
  • We can estimate the optimal value for k using the cross-validation technique described in the introduction
  • Due to limited availability of computing power, we will train our model on a subset of the MNIST training data consisting of 2,000 randomly selected cases. We will also employ a random search strategy to identify the optimal value of k through cross validation.
  • We will also use the kappa metric to estimate how much better than chance our model is at estimating class membership
  • The plot below shows average percentage accuracy and kappa values for each model across cross-validation folds, along with the variance in these metrics

  • The highest performing model across cross-validation folds was for k = 41, with an average error rate of 0.168 (or 16.8%) and average kappa = 0.813 (indicating the model performs substantially better than chance, according to common effect size guidelines)
  • The cross validation error rate for this value of k had approximately average variance across folds: its standard deviation for accuracy was 0.02, the same as the average for these models (to 2 decimal places)

Validating the model

  • On the holdout data, our final model had an error rate of 0.14 (or 14%)
  • This was similar to the error rate obtained from cross-validation, suggesting we do not have evidence of overfitting
  • It is substantially higher than the error rates obtained in the literature for k-nn methods, however. Examples include LeCun et al (1988), achieving a 5% error rate, and Kenneth Wilder, achieving a 3% error rate.
  • This is most likely because we used only a very small proportion of the available data to train our model (due to computing power limitations)

The plot below shows how the model’s performance varies across the different digits in the test set.

We can see that our model performs least well on the digits 9 (77% accuracy) and 1 (71% acccuracy). Nines are frequently confused with fours (15% of cases), whilst ones are frequently confused with twos and sevens. This is consistent with the findings from our dimensionality reduction exercise, which showed that these classes were the closest to each other in the feature space.

NB: As these are test set results, they are included here as an estimate of how well this knn model would perform on new data. They will not be used for model selection in the later sections, when comparing performance against a second algorithm. To do so would be to risk overfitting to the test set data.

Dimensionality-reduced K-NN

Owing to the so-called “curse of dimensionality”, the k-nearest neighbour algorithm can suffer when classifying high-dimensional data. This is likely to be exacerbated by the fact that we have had to reduce the size of the training data due to computational limitations. The dimensionality curse results from the fact that, as the number of dimensions increases, the number of data points per axis in the high-dimensional space reduces exponentially. The fact that we’ve had to filter to a subset of cases will mean we have even fewer cases per axis in our 784-dimension space.

To try and mitigate this issue, we will use principle component analysis to reduce the number of dimensions on which we will train our model.

Training the model

The screeplot below shows the results from the principal component analysis. In order to explain more than 80% of the variance in the data, we will retain the first 44 principal components.

Principle Component Analysis screeplot

The knn model was trained and tuned using the same cross-validation procedure as in the previous section. The results are plotted below.

Model training plots

We find that the best value of k is again 41. However, this reduced dimension model performs less well in training than the higher-dimension model: the best model had an accuracy of only 75%.

Support Vector Machine

  • SVMs have had good performance classifying the MNIST dataset in the past (e.g. Le Cun et al, 1998; < 2% error rate)
  • And typically perform well on high-dimensional data (reference)
  • So we will train an SVM model and compare it with knn, tuning the C (cost) parameter to compare different svm model specifications
  • In addition, we noticed in the previous experiment that the error rate increased with the value of k
  • We will therefore retrain our knn model, including k=1 as one of the model specifications
  • Random search repeated 5-fold cross validation used when training both the knn and svm models to compare performance and select the optimum model
  • The test set will only be used once the final model has been selected, to estimate how it would perform on noew unseen data

Training the model

  • Summary of svm
  • Plots show its performance (using Kappa and % Accuracy metrics, as this is a balanced classification task) as compared with knn
Accuracy metrics for knn and svm models

We find that although the svm specifications generally perform better than the knn models, the very best-performing model is knn-1: i.e. the knn model with k=1, achieving an 89.2% accuracy rate (i.e. a 10.8% classification error rate). Although we might expect such a low value of k to result in a model that fits to random noise in the data, the fact that it has such a low error rate in cross validation (and low variation across folds) suggests that this is not the case here. We will therefore select a knn model with k=1 as our final model.

Validating the model

On the holdout test dataset, our final KNN model had an error rate of just 8.7%. The plot below shows how accurately it predicts each numerical class in the holdout data.

We can see that our final model performs least well when classifying nines, on which it has an 85% accuracy rate. Note that it classifies nines as fours 7% of the time, the most common misclassification across all digits. This is in line with what we might expect from our dimensionality reduction stage, which showed that there is a significant amount of overlap between nines and fours in the feature space.

Conclusion

  • Write a brief conclusion on the results and compare to results published for other algorithms on the data set homepage
    • Similar performance level on hold-out data to linear classifiers from Le Cun et al, 1998 (e.g. 1-layer nerual network and pariwise linear classifier)
    • But higher error rate than their results for k-nearest neighbour models
    • Most likely resulting from the fact that we only included 2000 samples in training (due to limits on available computing power)
  • Which approach and parameter value is best suited?
    • Knn with k=1
    • Highest average accuracy (i.e. lowest classification error) and kappa values across cross-validation folds in training, with relatively low variance across folds
  • What other properties than just the classification error could be important to decide which method is most suited?
    • As this was a fairly balanced classification problem (i.e. the frequency of each class in the dataset was roughly equal), classification error is a reasonable method for model selection
    • We also used Kappa, as an indication of how much better than chance each model performed
    • This would be particularly useful for an imbalanced classification task - e.g. on a fraud detection task, where there would be very few members of the positive class, an algorithm could achieve a ery low classification error rate by simply marking all transactions as non-fraudulent (which would not be very helpful!)
    • In such cases, other metrics would be needed. These might include the use of Receiver Operator Curves (and the area under the ROC curve), to select the optimum trade-off between false positives and false negatives. This optimisation would need to take place alongside domain expert knowledge around the relative importance of false negatives and false positives (e.g. a driverless car that brakes more than it needs to is probably more desirable than one that doesn’t break when it does need to!)
    • In other situations where we are more interested in assigning class probabilities than outright class memberships (e.g. if we were a bookmaker estimating the probability of a football player scoring a goal in a match) then we might want to use a log-loss metric to penalise models according to how far their assigned probabilities are from the actual outcome (i.e. “near misses” are penalised less than “severe misses”)
  • Explain possible current limitations of your solutions and possible further strategies to improve on the results
    • Number of samples selected for training: accuracy would almost certainly be improved by using a more powerful machine that could be trained on all of the training samples
    • Tuning models over a larger number of parameters: For computational reasons, we trained a relatively small number of models (i.e. tuned each model over a relatively small parameter range). Fine-tuning each model with different parameters may improve performance
    • Use of convolutional neural networks (which generally require even more computational power, usually implemented via GPU): These tend to have the best performance on this task, from the MNIST literature
    • Pre-processing: results from the literature suggest that performance can be substantially improved through pre-processing techniques such as deskewing, noise removal, blurring and so on, achieving error rates as low as 0.35 for non-convolutional algorithms (and 0.23 for convolutional nets)